AITopics | answer choice

Country:

South America > Uruguay > Maldonado > Maldonado (0.04)
Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)

Genre: Research Report (0.67)

Industry: Leisure & Entertainment > Sports > Baseball (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)

Neural Information Processing SystemsFeb-17-2026, 15:43:23 GMT

e3cdc587873dd1d00ac78f0c1f9aa60c-Supplemental-Conference.pdf

artificial intelligence, machine learning, relation, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Neural Information Processing SystemsFeb-17-2026, 05:41:24 GMT

Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning

Recent successes in natural language processing have led to the proliferation of large language models (LLMs) by multiple providers. Each LLM offering has different inference accuracy, monetary cost, and latency, and their accuracy further depends on the exact wording of the question ( i .

large language model, machine learning, natural language, (21 more...)

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
North America > Mexico (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.92)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-10-2026, 15:30:28 GMT

8bb0d291acd4acf06ef112099c16f326-Supplemental-Conference.pdf

LastLetters F 500 15.0 - CoinFlip Y 500 37.0 - A.2.2 Datasetcreation Regarding "Last Letter Concatenation" and "Coin Flip", datasets are not publicly available sowe created the datasets following Wei et al. [2022] with a minor rephrasing of the question template. Asfor Coin Flip, we use the following template. A.5 PromptsForAnswerExtraction Table 9 and Table 10 summarizes a list of answer extraction prompts used for the experiments at Table1. Number Pick up the first number encounteredinthetext. MultipleChoice Pick up the first large letter encountered in the text. YesorNo Pickupthefirst"yes" or "no" encountered in the text after removing unnecessaryletters. Table 13 lists example texts generated by Zero-shot-CoT for each reasoning extraction template(SeeTable4). Dataset Question Answer SingleEq Q: A spaceship traveled 0.5 of a light-year from Earth to Planet X and 0.1 of a lightyearfromPlanetXtoPlanetY. A: Let's think step by step. So the total distance the spaceship traveled is 0.5 + 0.1 + 0.1 = 0.7 light-years. Therefore, the answer (arabic numerals) is: 0.7 light-years Q:Whilemaking desserts for abakesale,Victorused0.625 of a scoop of brown sugar as well as 0.25 of a scoop of whitesugar.Howmuchmore brownsugardidVictoruse? A: Let's think step by step.

large language model, machine learning, natural language, (19 more...)

Country:

North America > United States (0.14)
North America > Mexico (0.04)
Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)

Industry: Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Neural Information Processing SystemsFeb-9-2026, 10:34:04 GMT

639a9a172c044fbb64175b5fad42e9a5-Supplemental-Conference.pdf

answer choice, rationale, rationalization, (17 more...)

Country:

North America > Mexico (0.14)
North America > United States > New York (0.05)
Oceania > Australia (0.04)
(11 more...)

Industry:

Media (1.00)
Health & Medicine (1.00)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

arXiv.org Artificial IntelligenceDec-8-2025

Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling

Jha, Saurav, Mirza, M. Jehanzeb, Lin, Wei, Yang, Shiqi, Chandar, Sarath

Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.

artificial intelligence, reasoning, world modeling workshop 2026, (13 more...)

2512.05809

Country: North America > Canada > Quebec (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.92)

Neural Information Processing SystemsNov-18-2025, 01:12:23 GMT

515c62809e0a29729d7eec26e2916fc0-Paper-Conference.pdf

large language model, machine learning, natural language, (21 more...)

Country:

North America > United States (0.94)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.93)

Industry: Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

McMillan, Teague, Dominici, Gabriele, Gjoreski, Martin, Langheinrich, Marc

Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

arXiv.org Artificial IntelligenceNov-4-2025

Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

large language model, machine learning, natural language, (20 more...)

2510.24236

Country:

North America > United States (0.14)
Asia > Thailand > Bangkok > Bangkok (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

arXiv.org Artificial IntelligenceOct-28-2025

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Blume, Ansel, Kim, Jeonghwan, Ha, Hyeonjeong, Chatikyan, Elen, Jin, Xiaomeng, Nguyen, Khanh Duy, Peng, Nanyun, Chang, Kai-Wei, Hoiem, Derek, Ji, Heng

Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.

large language model, machine learning, segmentation, (14 more...)

2505.20759

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Illinois > Champaign County > Urbana (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Government (0.93)
Transportation > Air (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Vision (0.70)
(2 more...)

arXiv.org Artificial IntelligenceOct-28-2025

Improving the Distributional Alignment of LLMs using Supervision

Kambhatla, Gauri, Gautam, Sanjana, Zhang, Angela, Liu, Alex, Srinivasan, Ravi, Li, Junyi Jessy, Lease, Matthew

The ability to accurately align LLMs with human population groups on subjective questions would have great value. In this work, we show that use of simple supervision can greatly improve language model alignment with diverse population groups more consistently, as measured over three datasets spanning various topics. Beyond evaluating average alignment, we also report how alignment varies across specific groups. Our broad findings provide insights into the distributional alignment of LLMs with diverse population groups. By conducting evaluation over many LLMs and prompting strategies, along with open-sourcing our work, we provide a benchmark to stimulate future research.

artificial intelligence, large language model, natural language, (16 more...)